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Abstract 

Three methods for the estimation of the reliability of single dichotomous items 
are discussed. AH methods are based on the assumptions of nondecreasing and 
nonintersecting item response functions. Based on analytical and Monte Carlo 
studies, it is concluded that one method is superior over the other two, because it 
has a smaller bias and a smaller sampling variance. Furthermore, this method 
shows some robustness under violation of the condition of nonintersecting item 
response functions. Item reliability is of special interest for Mokken's 
nonparametric item response theory, and is useful for the evaluation of item 
quality in nonparametric test construction research. It is also of interest for 
nonparametric person fit analysis. 

Key words: item reliability, item response theory, Mokken model, nonparametric 
item response models, test construction. 
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Introduction 

For the practical use of tests total scores are more important than scores on 
individual items. In test construction, however, the quality of items must be 
assessed to select the appropriate items that, taken together, constitute a useful 
test For example, in classical test theory (CTT; Lord & Novick, 1968) item 
statistics like the p-value and the corrected item-total correlation are used for this 
purpose. In logistic item response theory (IRT; e.g., Lord, 1980) items can be 
evaluated on the basis of their difficulty, discrimination power, and 
pseudo-chance level. Moreover, the item information function (Lord, 1980, p. 72) 
can be used to assess measurement accuracy of the individual item. The 
nonparametric Mokken (1971, 1994; Mokken & Lewis, 1982) approach to IRT 
uses the p-value and an item scalability coefficient. 

Because the Mokken approach provides the theoretical framework for this 
study, we will further concentrate on its relevant assumptions and definitions. We 
will argue that in the Mokken IRT approach the reliability of an item can serve 
as a nonparametric counterpart of the discrimination power from logistic IRT and 
the corrected item-total correlation from CTT [refer to Lord (1980, p. 33) for a 
comparison of these latter two item statistics]. 

The purpose of this paper is to apply three relatively simple methods, used 
earlier for the estimation of total score reliability in the nonparametric Mokken 
IRT framework (Mokken, 1971, pp. 142 - 147; Sijtsma & Molcnaar, 1987), to 
the estimation of single item reliability. The asymptotic bias and the finite sample 
bias of these methods will be investigated. Furthermore, their standard deviation 
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will be studied, whereas results pertaining to skewness and kurtosis will be 
briefly summarized. 

The Nonparametric Mokken Approach and Item Reliability 

Nonparametric IRT models are important because of their potential to order 
persons and items. Cliff and Donoghue (1992) provide arguments that favor 
ordinal rather than interval measurement in psychological and educational testing. 
Mokken (1971, pp. 115 - 169; 1994; Mokken & Lewis, 1982) proposed two 
nonparametric IRT models for the analysis of binary item scores. The first is the 
model of monotone homogeneity (MH) which is defined by three assumptions: 
Unidimensionality, local stochastic independence, and nondecreasingness of the 
item response functions (IRFs). An important property of the MH model is that 
the latent trait score is stochastically ordered by the number-correct score on k 
items (Grayson, 1988; Huynh, 1994). Similar models were studied by Holland 
C981), Rosenbaum (1984), Stout (1990), Ellis and van den Wollenberg (1993), 
and Junker (1993); other ordinal models by Schulman and Haden (1975) and 
Cliff (1979). 

The second model is the model of double monotonicity (DM). This model 
rests on the same three assumptions as the MH model, plus the fourth assumption 
that the IRFs do not intersect. Thus, the DM model not only allows persons to be 
ordered, but also allows an ordering of items that is identical, except for possible 
ties, for all persons taking the test. Similar models were discussed by Rosenbaum 
(1987), Croon (1991), Sijtsma and Meijer (1992), and Sijtsma and Junker (1994). 



V 



Item Reliability 
4 



It may be noted that the Rasch (1960) model is based on the three 
assumptions from the MH model, plus the fourth assumption of minimal 
sufficiency of ihe number-correct scores of persons and items for the<*estimation 
of the latent person and item parameters, respectively (Fischer, 1974, pp. 193 - 
203). Not only are the IRFs from the Rasch model strictly increasing and 
nonintersecting but they are also parallel Levine (1970) has discussed conditions 
from which it can be derived that, in general, DM IRFs can not be transformed 
into Ras *h IRFs. For example, the DM model allows IRFs with asymptotes that 
are unequal to 0 or 1 whereas the Rasch model excludes such IRFs. Disregarding 
the trivial case of constant IRFs, theoretically, the DM model includes the Rasch 
model as a special case. In practice, however, differences will become apparent 
in particular for small numbers of items. For larger numbers the DM model still 
allows relatively easy items to have pseudo-chance levels larger than 0 and 
relatively difficult items to have upper asymptotes smaller than L This is not at 
all unrealistic because easy items may also be relatively easy for low-ability 
examinees, even if there is no guessing, while difficult items need not be trivial 
for high-ability examinees. Meijer, Sijtsma, and Smid (1990) provide a theoretical 
and a practical comparison of the DM and the Rasch model. 

Note that with their nonparametric definition of the IRFs, the MH and the 
DM models do not assume particular distributions for latent model parameters. In 
other words, characteristics of the models hold irrespective of such distributions. 
As a result of a nonparametric definition, latent item parameters from parametric 
models, such as the item difficulty and the discrimination power, can not be 
numerically estimated. In the nonparametric approach by Mokken (1971; Mokken 
& Lewis, 1982), the latent item difficulty is replaced by the proportion of correct 
responses given on an item (Mokken, 1971, p. 124). Furthermore, Mokken (1971, 
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p. 151; Mokken & Lewis, 1982) proposed an item coefficient that expresses the 
scalability of a particular item with respect co the scale of the other items. 
Mokken, Lewis, and Sijtsma (1986) noted that this coefficient is also related to 
the slope of an IRF. In addition, Donoghue and Cliff (1991) noted that the 
Mokken approach does not provide much specific information at the item level. 
An item statistic that is more directly related to the discrimination power could 
be useful in item selection. Such a statistic can also play a useful role in 
nonparametric person fit analysis (e.g., Meijer, Molenaar, & Sijtsma, 1993; 
Tatsuoka & Tatsuoka, 1983; van der Flier, 1982). In this study, item reliability is 
proposed as an appropriate replacement for discrimination power [also refer to 
Meredith (1965) for a similar proposal] in a nonparametric IRT context. This can 
be explained as follows. 

The reliability of an item expresses the degree to which observed item 
scores can be repeated independently under similar conditions. Discrimination 
power (denoted by a) as defined in logistic IRT (Lord, 1980) has a similar 
interpretation. Let 0 be the latent person parameter with probability density f(6). 
Furthermore, let item g have a latent difficulty parameter 8g and a latent 
discrimination parameter a„. Keeping f(0) and 8^ fixed, an increase in a g 

© e © 

corresponds to a higher degree of repeatability of observed scores on item g. In 
the limit (0Cg — > °°), response performance is in accordance with the deterministic 
Guttman (1950) model: this means perfect repeatability and thus perfect item 
reliability. For response behavior following a logistic IRT model an increase in 
a p yields lower probabilities of a correct response to the left of 8„ and higher 

o © 

probabilities to the right of it. Consequently, for each subject with 0 * 8g 
his/her dominant item response (which is incorrect for 0 < 8g and correct for 
0 > 8 ) can be predicted with higher probability. Note, that for 0 = 8 the 
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success probability is a constant irrespective of a . In other words, holding 
everything else constant, an increase in a p corresponds to a higher degree of 
repeatability of item scores. 



Definition and Estimation of Item Reliability 



Because the theoretical basis for the definition and the estimation of item 
reliability was given by Mokken (1971, pp. 142 - 147), and Sijtsma and 
Molenaar (1987), we will only provide results here. Let Kg be the population 
proportion of persons giving a correct response on the dichotomous item g, and 
Tigg the population proportion giving a correct response on two locally 
independent replications of item g. As a tool for estimating the reliability of a 
test score, Mokken (1971 p. 143) defines the reliability of the dichotomous item 
score X„ as 



P (x ) = JL — * 

8 (1) 

Reliability equal to 0 is obtained if n = k q (statistical independence between 
replications of item g); reliability equal to 1 if = Kg. 

The proportion k can be estimated unbiasedly (Mokken, 1971, p. 126) but 
because locally independent replications of items usually are absent, a direct 
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estimate of n gg is not available. Therefore, Mokken (1971, p. 143) proposed two 
methods using parameters for which sample estimators are available to 
approximate 7t gg . Sijtsma and Molenaar (1987) proposed a third method. All 
three methods are based on extrapolation or interpolation using items adjacent to 
item g in the ordering of items from difficult to easy. The rationale is the 
following. 

Assume that the k items from the test are ordered according to increasing 
7t ? and that item indices are in accordance with this ordering. Let the IRFs 

denoted by 7t g (0) (g = 1 k) of all k items be nonintersecting: for items g-1, 

g, and g+1 this means that 

n g _ { (Q) < n g (Q) < n g+l m, for all 9 . (2) 

Based on the idea that the IRFs of the neighbor items in the item ordering are 
more similar to n g (0) than the other IRFs, all three methods use either n g _j(0) 
or 7i g+ j(8), or both as a predictor of a real replication of item g. Note that 7t gg 
equals 

n gg = jn g mn g (W(Q) . (3) 
9 

Before integrating with dF(0X one of the probabilities rc g (Q) is replaced by a 
linear approximation using one or two of its neighbors, n g _i(0) or rc g+ |(6), or 
both: 7r f (8) = a + b7r g ,j(e) + crc g<f |(0). Each method is defined by the choice 
of a, b, and c. Substitution of 7r f (0) in (3) and integration yield 
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+ ck 



(4) 



In (4), 7t D i 0 is the population proportion of persons that have correct responses 
on both items g-1 and g. A similar definition applies to rc c c , t. Mokken's (1971, 



and ^gg + i- Sijtsma and Molenaar (1987) provided two alternative 
approximations to Kgg. Because each of thes*; four approximations is 
asymptotically biased (Molenaar & Sijtsma, 1984), Sijtsma and Molenaar's 
(1987) method used the unweighted mean of these four approximations which has 
only small bias. 

Mokken's method 2 uses both neighbors of item g to approximate Kgg by 
interpolation (Mokken, 1971, p.147). For the two extreme items extrapolation 
(method 1) is used. Refer to Sijtsma and Molenaar (1987) for further details. 
Note that these earlier publications only give results pertaining to sample bias and 
variance of total score reliability estimation for each of the three reliability 
i ethods: p 2 . and p MS . 

All approximations to Kgg are functions of the bivariate proportions and 
the distance between item difficulties. If a bivariate proportion is smaller or a 
distance is larger than expected if the items had been replications this may bias 
fc tt and, consequently, the reliability estimate of item g. Figure 1 illustrates the 
effect of distance on the approximation of Kg(0) by means of one neighbor 
(method 1; left in Figure 
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1) or two neighbors (method 2; right in Figure I). To illustrate estimation of 71-- 
using method 1 we need the extrapolation formula (Mokken, 1971, p. 147) 

a-- . (5) 
" V. 



For method 1, the striped curve denoted ft f (0) in Figure 1 (left) gives the 
approximation to rc g (0) using (n g /7i g+1 )7i g+1 (0y [note that substitution of this 
product in (3) yields (5)]. /vssnme that 0 follows a normal distribution with its 
peak at the scale value for which 7i g (0) = .5. The approximation on the basis of 
rc g+ l(0) overestimates rc g (0) to the left of the scale, but it underestimates 
7i g (0) to the right of it. As it is multiplied by the factor 7i g (0)dF(0), higher 
values of 0 tend to contribute most to the integral that yields the approximation 
to 7igg in (5). The underestimation thus tends to dominate the overestimation. A 
larger distance usually results in a worse approximation. If 7i g+ |(0) - rc g (0) 
increases while keeping * g (0) fixed, the multiplication factor k^k^+i in (5) 
decreases and the approximation to * g (0) lies further to the left of n g (0) and 
also further below it at the right side of the scale. Thus, it tends to underestimate 
7i g (0) more strongly if the distance is larger. The same line of reasoning leads to 
the conclusion that the approximation based on ^ g .|(0) (formula not given here) 
tends to overestimate iz (0) and, as a result, ft more strongly overestimates 
7i gg if the curves n g _|(0) and n g (0) He further apart. 

For method 2 (Figure 1, right; formula not given here), the underestimation 
at the right of the scale obtains a larger weight than the overestimation at the left, 
and ft M according to method 2 tends to be an underestimate. Moving ^g.j(0) 
further to the right while keeping rc g (0) and n g+ j(0) fixed, thus increasing 
inequality of distances leads to a situation in which it is difficult to predict how 
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the bias of ft will be affected. 

IB 

These examples lead to the conclusion that distance affects the degree to 
which ft is biased, and unequal distances of both neighbors to kJQ) affects 
the bias differently than equal distances. Given the susceptibility of the item 
reliability methods to the quality of other items in the test, it will be investigated 
which of the three methods has the smallest bias. 

An alternative approach would be the use of the m (m > 2) nearest 
neighbors to approximate k^. However, neighbors that are farther away are less 
similar (in the sense of replications) to item g than the two nearest neighbors. 
Thus, we would expect larger bias in estimating jTg g for m > 2. By using more 
information from the data, however, the sampling variance of the estimates might 
decrease compared with m = 2. An acceptable compromise between bias and 
accuracy would, probably, depend on several characteristics of test, items, and 
population. Also refer to Donoghue and Cliff (1991) and Cliff and Donoghue 
(1992) who use ordinal multiple regression for a related problem in ordinal true 
score theory. Rather than pursuing a more complex strategy, we will stay within 
the confines of the Mokken approach and investigate asymptotic and sampling 
characteristics of reliability estimators based on the simpler methods 1, 2, and 
MS. Only if none of these methods yields satisfactory results may a more 
complex strategy be rewarding. 

An analytical derivation of the distribution properties of the three methods 
is not pursued because the ordering of the items according to their difficulty may 
well vary across random samples and different approximations to will be 
used. Therefore, conclusions will be based on simulation studies. 
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Asymptotic Bla«: in Item Reliability Estimation Methods 



Method . As a first step, the bias of each of the three item reliability 
methods with respect to p(X g ) in (1) was investigated using population fractions 
obtained via numerical integration across the ability distribution. This allowed to 
study the performance of the three methods in the ideal case of very large 
samples. Throughout this study sets of 7 items were used. Such a small set was 
large enough because (1) the focus of attention was on the individual item; (2) 
distance between items could be manipulated equally well in small and large sets; 
(3) differences between extremely located items and items in between could be 
studied independently of test length; and (4) with usually smaller distances 
between adjacent items in longer tests, results for smaller tests were expected to 
be conservative. Furthermore, logistic IRFs were used. Note, in particular, that 
although our theoretical framework is nonparametric IRT, parametrically defined 
IRFs and parameter distributions are necessary to simulate O's and Ts. 

Given 7 two-parameter logistic IRFs and a standard normal distribution of 
0, numerical integration (IMSL routine DCADRE, 1982) was used to obtain the 

population proportions n g (g = 1 7), * gg (g = 1, 7) and 7t gh (g, h = 1, 

7; g * h). Using rc g and n gg , the item reliability p(X g ) was calculated. To 
calculate item reliability with approximation methods 1, 2 and MS, the 
proportions K g and were used: the results are denoted by p|, pj* and Pms> 
respectively. The difference between each of these parameters and p(X g ) equals 
the bias of a specific method with respect to tie reliability (1) for item g (g = 1, 
.... 7). 
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A completely crossed 4x2x3 design was used. The first factor was 
average discrimination power a^j (subscript M denotes mean), with four levels: 
ctjvj = .5, 1, 2, and 5. In combination with a standard normal distribution of 6 
these values cover the complete range from very weak to very strong 
discrimination (Meijer et al., 1993). The second factor was spread of the as 
within one test, with two levels: zero spread (all 7 as equal) and positive spread 
(as unequal). Zero spread corresponds to noniutersection of two-parameter 
logistic IRFs. For example, for a^ = 1 we have ttg = 1 (g = 1, 7). Positive 
spread corresponds to intersection of IRFs, and thus provides a violation of a 
condition underlying estimation of item reliability. For example, for a^ = 1 we 
have a = (1.3, 1, 1, .7, 1, 1.3, .7). This more realistic condition allows us to 
investigate the robustness of uie estimation methods. The third factor was 
distance between item locations. A distinction was made between sets of equally 
spaced items and sets of unequally spaced items. Three levels were distinguished. 
On two levels, item locations (8s) were equidistant with median equal to zero 
and distance [d(5)] equal to either .1 or .5, respectively. These levels are denoted 
ES (Equidistant, Small distance) and El- (Equidistant, Large distance), 
respectively. On the third level, d(8) varied more realistically within one item 
set. In particular, 8 = (-.4, -.3, -.2, 0, .2, .8, 1.6) for all design cells on this level. 
The third level is denoted UD (Unequal Distance). 

Results . Table 1 shows a summary of the asymptotic bias results for the 
complete design. For Nonintersecting IRFs (left half of Table 1) the 
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general conclusion for method MS is that the reliability is almost unbiased for 
most stems (results denoted by #nobi, number of items having "no bias"). Out of 
84 reliabilities (12 cells), 70 have a bias smaller than 1.011, and 75 have a bias 
smaller than 1.031. The largest bias (denoted min, for <x M = 5 and UD) is -.06. 
The results for #nobi, min and max are almost always better for method MS than 
for methods 1 and 2. These latter methods often yield unacceptably large biases, 
for example, larger than 1. 101. Method 1 often has a large bias for most of the 7 
items in the test. Method 2 mostly yields large biases for the two extreme items 
(for which, in fact, method 1 is used) and sometimes also for the items in 
between. 

For intersecting IRFs (right half of Table 1) asymptotic bias is larger for all 
three methods. For method MS, 25 of the 84 reliabilities have a bias smaller than 
1.01 1, and 53 have a bias smaller than 1.031. The largest bias is -.07 (min for a M 
= 5 and UD). As for Nonintersecting IRFs, the results for Intersecting IRFs are 
almost always better for method MS than for the other two methods. With a few 
exceptions (not all individual values are shown in Table 1), the bias of method 
MS for individual item reliabilities is acceptable. 

For method MS, the grand mean of the bias is .001. Main effects and 
interaction effects are mostly very close to 0 (< 1.0 1 1), with one exception for 
= 5 and EL (first-order interaction is -.03). 

It can be concluded for method MS that: (1) bias is smaller than for 
methods 1 and 2; (2) bias is often negligible or practically acceptable; and (3) 
bias stays within reasonable limits even if IRFs intersect. 
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Finite Sample Estimation of Item Reliability 

Method . A Monte Carlo study was conducted to assess the sampling 
characteristics of the three approximations to item reliability for realistic sample 
sizes. Despite the larger asymptotic biases for method 1 and method 2 (Table 1), 
they were also subjected to the Monte Carlo investigation because: (1) not only 
bias is important but also sampling variance; (2) it could well happen that a 
method with larger asymptotic bias has smaller finite sample bias, given e.g. the 
additional problem of different neighbors mentioned above; and (3) methods 1 
and 2 are simpler than method MS and might thus be recommended if the bias of 
method MS ii only slightly larger. 

Data .matrices containing n(pcrsons) x k(items) binary item scores were 
generated (for the simulation procedure see Sijtsma & Molenaar, 1987) using 
two-parameter logistic IRFs and a standard normal distribution of 6. The design 
from the asymptotic bias study was extended by adding sample size as a fourth 
factor with three levels: n = 100, 300, and 900. The sample size n = 100 can be 
considered to be typical of ad hoc test construction that is part of a larger 
research project, n s 300 of test construction research as performed in a non- 
commercial environment (e.g., universities, where the means to collect data from 
larger samples are limited), and n = 900 (or more) of large scale test construction 
on a more commercial basis. 

Thus a completely crossed 4x2x3x3 design was used. The number of 
replications in each cell of the design was 200. For each replication, the 
estimated 7Cg and were used (in the order found from that matrix) for 



Iiem Reliability 
15 



estimation of p by methods 1, 2, and MS. 

Results . Method MS has almost always a smaller finite sample bias than 
methods 1 and 2. In addition, for practical purposes the bias of method MS can 
be ignored. Furthermore, the standard deviation of method MS is almost always 
smaller than that of methods 1 and 2 (not tabulated here). Because of these 
results only the Monte Carlo results for method MS are discussed. 

In Table 2 (results for bias and standard deviation for n = 300 ), it can be 
seen that method 
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MS is almost unbiased. For the widely spaced items (Table 2, EL) that have 
Nonintersecting IRFs (Table 2, first half) the bias is, except for = 5, 
somewhat larger for the extreme items. For a^j = 5, the bias is larger for the 
items in between. For unequally spaced items (Table 2, UD), bias is negligible 
save a few exceptions if = 2 and = 5. For n = 100 (not tabulated), bias 
is in general somewhat higher, especially for a = .5 and a = 1. For n = 900 (not 
tabulated), bias results are highly comparable to the results obtained for n = 300. 

For n = 300 and Nonintersecting IRFs (Table 2, first half), the standard 
deviation for almost all items is approximately .05. Only the standard deviation 
for the extremely easy and difficult items from widely spaced sets of items 
(Table 2, FL) sometimes is somewhat larger. For small sample size (n = 100; not 
tabulated), the standard deviation of method MS across samples is rather large 
(between .7 and .13 for the extreme items and between .04 and .09 for the items 
in between). For large sample size (n = 900; not tabulated), the standard 



10 



Item Reliability 
16 



deviation for almost all items as approximately .025. In general, for n = 100 the 
standard deviation is approximately ^3 times as large as for n = 300, and for n = 
900 it is approximately ^3 times as small as for n = 300. 

The results for the third and fourth moments (not tabulated) are briefly 
summarized. For a M = 1 and ot M = 2 the distribution of estimator MS is rather 
symmetrical around its mean (all sample sizes; skcwness between 
-.4 and .4). For = .5, for some items the distribution is positively skewed. 
For ctj^ = 5 (all sample sizes), the distribution is negatively skewed for some 
items and positively skewed for others. The peakedness of the distribution is 
more or less comparable with the normal distribution for all discrimination levels 
(in general, the kurtosis is approximately 3). 

If IRFs intersect (Table 2, second half), the bias of method MS is generally 
larger than if IRFs do not intersect (Table 2, first half). The pattern of biases 
across items within a test is rather inconsistent. For a few items bias ranges 
from -.08 to .05. However, for the majority of the items bias is much smaller. 
Compared with nonintersection of IRFs (Table 2, first half), standard deviation 
results are approximately the same if IRFs intersect (Table 2, second half). The 
same conclusion holds for skewness and kurtosis results. 

The grand mean of the bias is -.001 for the results pertaining to n = 300. 
The vast majority of main and interaction effects is smaller than 1.0 II. A few 
exceptions occur for some first, second, and third order interactions (effects 
smaller than 1.021 in most cases; never larger than 1.031) for = 5. ANOVA 
results for the standard deviation show no interesting effects. 
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Discussion 

This study has introduced and compared three methods (method 1, method 
2, and method MS) for the estimation of the item reliability that are based on the 
Mokken model of double monotonicity. Method MS was unbiased .for almost all 
items with the exception of a small and probably unimportant bias for the 
extreme items if item difficulties are widely spaced. In addition, in all cases 
studied, method MS had smaller bias than the other two reliability methods. 
Reduction of the bias of method MS for the extreme items seems problematic, 
because the use of, for example, the m (m > 2) nearest neighbor items rather than 
the two nearest neighbors would probably reduce the standard deviation but 
increase the bias, in particular for the two extreme items. 

Method MS had the smallest standard deviation across random samples. For 
a sample size n = 300, its standard deviation ranged from .03 to .06. For small 
samples (n = 100), on the average the standard deviation was larger by a factor 
of approximately >/3 and for larger samples (n = 900) it was smaller by the same 
factor. It may be concluded that for n = 300 and larger samples reliability 
estimates are accurate enough to allow the identification of unreliable items. For 
small samples (n = 100) accuracy may be too small, but it may be noted that 
such samples are generally considered to be too small for serious test 
construction and only allow tentative conclusions about the quality of a test and 
its items. Finally, other results indicated that the sampling distribution of method 
MS is approximately symmetrical in most situations thai were considered here. 
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Figure Caption 

Three IRFs with 7C g+1 =.697, 7C g =.500, 7C g _ lg =.162, and 

K oo ,i=.420, illustrating the approximation of rc o (0) by means of method 1 
(left; dashed curves) and method 2 (right; dashed curves). Proportions based on 
5g + i=-l, 5g=0, 8g j=1.5, a=l for all three items and 0 standard normally 
distributed. 
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